Load Packages

Let’s load the tidyverse package.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ✔ purrr   0.3.4
## Warning: package 'tibble' was built under R version 4.1.2
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## Warning: package 'dplyr' was built under R version 4.1.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Import Data

Let’s import our data using read_csv.

data <- read.csv("data/chds6162_data.csv")

Scatterplot

We use geom_point to make a scatterplot. Let’s make a scatterplot that shows age on the x axis and height on the y axis.

ggplot(data = data,
       mapping = aes(x = age,
                     y = ht)) +
  geom_point()
## Warning: Removed 23 rows containing missing values (geom_point).

#another way you may see this

ggplot(data,aes(age,ht)) + geom_point()
## Warning: Removed 23 rows containing missing values (geom_point).

Histogram

We use geom_histogram to make a histogram. Let’s make a histogram of age.

ggplot(data = data, 
       mapping = aes(x = age)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).

How does ggplot know what to plot on the y axis? It’s using the default statistical transformation for geom_histogram, which is stat = "bin".

If we add stat = "bin" we get the same thing. Each geom has a default stat.

ggplot(data = data, 
       mapping = aes(x = age)) +
  geom_histogram(stat = "bin")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).

#shorter way to do write it:

ggplot(data,aes(age)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).

We can adjust the number of bins using the bins argument.

ggplot(data = data, 
       mapping = aes(x = age)) +
  geom_histogram(bins = 10)
## Warning: Removed 2 rows containing non-finite values (stat_bin).

Bar Chart

There are two basic approaches to making bar charts, both of which use geom_bar.

Approach #1

Use your full dataset.

Only assign a variable to the x axis.

Let ggplot use the default stat transformation (stat = "count") to generate counts that it then plots on the y axis.

Approach #2

Wrangle your data frame before plotting, possibly creating a new data frame in the process

Assign variables to the x and y axes

Use stat = "identity" to tell ggplot to use the data exactly as it is

Bar Chart v1

Let’s make a bar chart that shows height.

ggplot(data = data, 
       mapping = aes(x = age)) +
  geom_bar()
## Warning: Removed 2 rows containing non-finite values (stat_count).

The default statistical transformation for geom_bar is count. This will give us the same result as our previous plot for histograms.

ggplot(data = data, 
       mapping = aes(x = age)) +
  geom_bar(stat = "count") 
## Warning: Removed 2 rows containing non-finite values (stat_count).

#or
ggplot(data, aes(age)) + geom_bar() 
## Warning: Removed 2 rows containing non-finite values (stat_count).

Here’s what’s going on.

Bar Chart v2

It’s often easier to do our analysis work, save a data frame, and then use this to plot.

Let’create a dataframe of gestation lenght (this time in weeks) by mother smoking habits.

gestation_by_smoke <- data %>% 
  mutate(gestation_w = gestation/7,
         smoke = case_when(
    smoke == 1 ~ "smokes now",
    smoke == 2 ~ "until now",
    smoke == 3 ~ "once did",
    smoke == 0 ~ "never")) %>% 
  group_by(smoke) %>% 
  summarize(gestation_w = mean(gestation_w,na.rm = TRUE)) %>%
  drop_na(smoke)

Then let’s use this data frame to make a bar chart. The stat = "identity" here tells ggplot to use the exact data points without any stat transformations.

ggplot(data = gestation_by_smoke, 
       mapping = aes(x = smoke, 
                     y = gestation_w)) +
  geom_bar(stat = "identity") 

color and fill

color

We add the color argument within the aes so that the data in that variable is mapped to those aesthetic properties.

Let’s add different colors for males and males to our previous scatterplot.

data <- data %>% 
  mutate(smoke_lbl = case_when(
    smoke == 1 ~ "smokes now",
    smoke == 2 ~ "until now",
    smoke == 3 ~ "once did",
    marital == 0 ~ "never"))

ggplot(data = data,
       mapping = aes(x = age,
                     y = ht,
                     color = smoke_lbl)) + 
  geom_point()
## Warning: Removed 23 rows containing missing values (geom_point).

#what if our "color"variable is continues rather than labels

ggplot(data,aes(age,ht,color = smoke)) + geom_point()
## Warning: Removed 23 rows containing missing values (geom_point).

Let’s try the same thing with our last bar chart (gestation_by_smoke.

ggplot(data = gestation_by_smoke, 
       mapping = aes(x = smoke, 
                     y = gestation_w,
                     color = smoke)) +
  geom_bar(stat = "identity") 

That didn’t work! Let’s try fill instead.

ggplot(data = gestation_by_smoke, 
       mapping = aes(x = smoke, 
                     y = gestation_w,
                     fill = smoke)) +
  geom_bar(stat = "identity")

Scales

color

We can change which colors the data is mapped to by using a scale_ function.

Let’s use a built-in palette like scale_color_viridis_d (d = discrete data).*

*FYI: The viridis scales provide colour maps that are perceptually uniform in both colour and black-and-white. They are also designed to be perceived by viewers with common forms of colour blindness. The package contains 4 color scales: viridis, magma, plasma, and infermo.

ggplot(data = data,
       mapping = aes(x = age,
                     y = ht,
                     color = smoke_lbl)) + 
  geom_point() +
  scale_color_viridis_d(option = "plasma")
## Warning: Removed 568 rows containing missing values (geom_point).

# shorter version
ggplot(data,mapping = aes(age,ht,color = smoke_lbl)) + geom_point() + scale_color_viridis_d(option = "plasma")
## Warning: Removed 568 rows containing missing values (geom_point).

Plot Labels

To add labels to our plot, we use labs. let’s add a title argument to the last scatterplot.

ggplot(data,mapping = aes(age,ht,color = smoke_lbl)) +
  geom_point() + 
  scale_color_viridis_d(option = "plasma") + 
  labs(title = "Mother's age and height by smoking habits")
## Warning: Removed 568 rows containing missing values (geom_point).

We can add a subtitle as well.

ggplot(data,mapping = aes(age,ht,color = smoke_lbl)) + 
  geom_point() + scale_color_viridis_d(option = "plasma") + 
  labs(title = "Mother's age and height by smoking habits", 
       subtitle = "Data from the Child Health and Development Studies 1961 and 1962")
## Warning: Removed 568 rows containing missing values (geom_point).

We can change the x and y axis labels using the x and y arguments.

ggplot(data,mapping = aes(age,ht,color = smoke_lbl)) + 
  geom_point() + scale_color_viridis_d(option = "plasma") + 
  labs(title = "Mother's age and height by smoking habits", 
       subtitle = "Data from the Child Health and Development Studies 1961 and 1962", 
       x = "Age",
       y = "Height (inches)", 
       color = "Smoking habits")
## Warning: Removed 568 rows containing missing values (geom_point).

Themes

To add a theme to a plot, we use the theme_ set of functions. There are several built-in themes. For instance, theme_minimal.

ggplot(data,mapping = aes(age,ht,color = smoke_lbl)) + 
  geom_point() + 
  scale_color_viridis_d(option = "plasma") + 
  labs(title = "Mother's age and height by smoking habits", 
       subtitle = "Data from the Child Health and Development Studies 1961 and 1962", 
       x = "Age",
       y = "Height (inches)", color = "Smoking habits") + 
  theme_minimal()
## Warning: Removed 568 rows containing missing values (geom_point).

There are also packages that give you themes you can apply to your plots.

ggthemes package

library(ggthemes)
#?ggthemes

We can then use a theme from this package (theme_excel_new) to make our plots look like those in the new version of Excel.

ggplot(data,mapping = aes(age,ht,color = smoke_lbl)) + 
  geom_point() + scale_color_viridis_d(option = "plasma") + 
  labs(title = "Mother's age and height by smoking habits", 
       subtitle = "Data from the Child Health and Development Studies 1961 and 1962", 
       x = "Age",
       y = "Height (inches)", 
       color = "Smoking habits") + 
  theme_excel_new()
## Warning: Removed 568 rows containing missing values (geom_point).

#what about APA?
library(jtools)
## Warning: package 'jtools' was built under R version 4.1.2
ggplot(data,mapping = aes(age,ht,color = smoke_lbl)) + 
  geom_point() + scale_color_viridis_d(option = "plasma") + 
  labs(title = "Mother's age and height by smoking habits", 
       subtitle = "Data from the Child Health and Development Studies 1961 and 1962", 
       x = "Age",
       y = "Height (inches)", 
       color = "Smoking habits") + 
  theme_apa() 
## Warning: Removed 568 rows containing missing values (geom_point).

Facets (my favorite feature when making graphs)

You can make small multiples by adding just a line of code using the facet_wrap function. Let’s make separate plot for all the labels in the smoking variable

ggplot(data,mapping = aes(age,ht,color = smoke_lbl)) + 
  geom_point() + scale_color_viridis_d(option = "plasma") + 
  labs(title = "Mother's age and height by smoking habits", 
       subtitle = "Data from the Child Health and Development Studies 1961 and 1962", 
       x = "Age",
       y = "Height (inches)", 
       color = "Smoking habits") + 
  theme_apa() + 
  facet_wrap(~smoke_lbl)
## Warning: Removed 568 rows containing missing values (geom_point).

We can do this for any type of figure. let’s make multiple histograms for age by smoking habits

ggplot(data = data, 
       mapping = aes(x = age)) +
  geom_histogram() +
   theme_apa() + 
  facet_wrap(~smoke_lbl) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).

Another example:

ggplot(data = data,
       mapping = aes(x = age,
                     y = ht,
                     color = smoke_lbl)) + 
  geom_point() +
  scale_color_viridis_d(option = "magma") +
  labs(title = "Association Between Age and Height",
       subtitle = "Data from the Child Health and Development Studies 1961 and 1962",
       x = "Age",
       y = "Height (inches)",
       color = "Smoking Habits") +
  theme_economist() +
  facet_wrap(~ed)
## Warning: Removed 568 rows containing missing values (geom_point).

Save Plots

RMarkdown: just knit your file and your plots will show up as part of your HTML, Word, or PDF document.

just by itself: use the ggsave function. By default, ggsave will save the last plot you made.So you can add it to each of the graphs you want to save.

ggplot(data,mapping = aes(age,ht,color = smoke_lbl)) + 
  geom_point() + scale_color_viridis_d(option = "plasma") + 
  labs(title = "Mother's age and height by smoking habits", 
       subtitle = "Data from the Child Health and Development Studies 1961 and 1962", 
       x = "Age",
       y = "Height (inches)", 
       color = "Smoking habits") + 
  theme_apa() + 
  facet_wrap(~smoke_lbl)
## Warning: Removed 568 rows containing missing values (geom_point).

ggsave("plots/plot_example.png")
## Saving 7 x 5 in image
## Warning: Removed 568 rows containing missing values (geom_point).

We can save our plot to other formats as well. PDF is a great option.


```r
ggsave("plots/example.pdf")
## Saving 7 x 5 in image
## Warning: Removed 568 rows containing missing values (geom_point).